Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2415: Reuse hadoop file status and footer in ParquetRecordReader #1242

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

wankunde
Copy link

@wankunde wankunde commented Dec 21, 2023

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines
    from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Style

  • My contribution adheres to the code style guidelines and Spotless passes.
    • To apply the necessary changes, run mvn spotless:apply -Pvector-plugins

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • Call ParquetInputSplit.setFooter(footer) before creating a ParquetRecordReader from this split, then the reader will reuse the Hadoop file status and skip the getfileinfo Hadoop RPC and skip reading the footer again.

@wankunde
Copy link
Author

I can not reproduce the failed UT:

2023-12-22T04:25:27.6546055Z [INFO] Running org.apache.parquet.cli.commands.ShowFooterCommandTest
2023-12-22T04:25:27.8426114Z [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.179 s <<< FAILURE! - in org.apache.parquet.cli.commands.ShowFooterCommandTest
2023-12-22T04:25:27.8428914Z [ERROR] testShowDirectoryCommand(org.apache.parquet.cli.commands.ShowFooterCommandTest)  Time elapsed: 0.179 s  <<< ERROR!
2023-12-22T04:25:27.8699007Z com.fasterxml.jackson.databind.JsonMappingException: Document nesting depth (1001) exceeds the maximum allowed (1000, from `StreamWriteConstraints.getMaxNestingDepth()`) (through reference chain: org.apache.parquet.hadoop.util.HadoopInputFile["fs"]->org.apache.hadoop.fs.LocalFileSystem["key"]->org.apache.hadoop.fs.FileSystem$Cache$Key["ugi"]->org.apache.hadoop.security.UserGroupInformation["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"]->javax.security.auth.Subject["principals"]->java.util.Collections$SynchronizedSet[5]->org.apache.hadoop.security.User["login"]->javax.security.auth.login.LoginContext["subject"])
2023-12-22T04:25:27.8914504Z Caused by: com.fasterxml.jackson.core.exc.StreamConstraintsException: Document nesting depth (1001) exceeds the maximum allowed (1000, from `StreamWriteConstraints.getMaxNestingDepth()`)
2023-12-22T04:25:27.8916205Z 

If any one could help check this issue ?

@wgtmac
Copy link
Member

wgtmac commented Dec 24, 2023

Have you tried to run mvn install before running the cli test? It may run the test with dependency from maven central without your patch.

@wankunde
Copy link
Author

Hi, @wgtmac I have fixed this issue in cli module. Thanks

@@ -84,6 +86,9 @@ public static ParquetMetadata fromJSON(String json) {
private final FileMetaData fileMetaData;
private final List<BlockMetaData> blocks;

@JsonIgnore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this annotation required?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jackson mapper will not serialize this field to json with this annotation and keep the the same behavior as before.

HadoopInputFile inputFile;
if (split.getFooter() != null
&& split.getFooter().getInputFile() != null
&& split.getFooter().getInputFile() instanceof HadoopInputFile) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @amousavigourabi to see if there is any chance to apply this to the non-hadoop code path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wgtmac, given that the inputFile variable seems to only be used in a constructor expecting an InputFile and not necessarily a HadoopInputFile, I think this instanceof condition could be dropped. As I just quickly skimmed it now and might have missed something, I'll take a more thorough look after Boxing Day. Happy holidays!🎄🎆

Copy link
Author

@wankunde wankunde Dec 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wgtmac @amousavigourabi for your review.
I have changed the HadoopInputFile to InputFile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Going back to @wgtmac's earlier concern, the method this snippet is part of is already in the Hadoop code path and I'm not sure whether there is a more generic alternative available. For the rest, the switch to using the plain InputFile interface here is of course amazing for flexibility in the future and makes the code a bit cleaner. Thanks a lot for the swift fix @wankunde!

@wankunde
Copy link
Author

wankunde commented Jan 3, 2024

Hi, @wgtmac @amousavigourabi is there any concern about this PR ?

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on my side.

I think this should be a very common requirement and not sure if the community has discussed this before. cc @gszadovszky @shangxinli @Fokko @ConeyLiu

@amousavigourabi
Copy link
Contributor

Hi @wankunde , sorry for the delayed response. I don't see any blockers on my side and love the patch, so its a +1 from me.

@@ -95,6 +95,11 @@
<artifactId>jackson-databind</artifactId>
<version>${jackson-databind.version}</version>
</dependency>
<dependency>
<groupId>${jackson.groupId}</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid adding dependency?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jackson-annotations dependency is used in parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ParquetMetadata.java . Do not serialize InputFile inputFile to json and keep the the same behavior as before. I'm sorry I'm not familiar with jackson library and not sure is there any other way to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I happened to find that we have a parquet-jackson module which shades jackson-core and jackson-databind. But in the parquet-hadoop (and other modules) it also explicitly depends on parquet-jackson and jackson-xxx at the same time. I'm not familiar with this history, do you know why? @gszadovszky @Fokko @shangxinli

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac, the README of parquet-jackson describes how it works. This is only for doing the shading once (and having one shaded jar) instead of in all the modules which requires jackson.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Sorry for missing that.

return footer;
}

public void setFooter(ParquetMetadata footer) {
Copy link
Contributor

@ConeyLiu ConeyLiu Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ParquetInputSplit is marked as deprecated. And the recommended usage is FileSplit. How does Spark set the footer after the ParquetInputSplit is removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now parquet will convert the input split to ParquetInputSplit and build the reader with it. I think if ParquetInputSplit was removed from ParquetFileReader class, spark need a shim class to work with different parquet version.

That will be a big change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR can be a good reason to push the spark community to migrate. Or we can fix this in only spark 4.x?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already filed a WIP ticket apache/spark#44853 for spark 4 and will discuss about this change in spark side in that PR after this PR is merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, other comments have suggested that we should not work on a deprecated interface. Therefore I don't expect this PR will be merged as is. It would be good to figure out the final solution on the spark side before any action here.

@@ -64,13 +65,19 @@ public int run() throws IOException {
return 0;
}

abstract class MixIn {
@JsonIgnore
abstract int getInputFile();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1242 (comment)

I'm sorry, the UT failed, I don't know why.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean this is a workaround to get rid of the test failure at the cost of a new dependency?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the UT will fail if just annotating the getInputFile method, create a MixIn class here (parquet-cli module) to workaround.

Parquet project already has a dependency of jackson-annotations library in some other modules. So I don't think this PR will add a new dependency in parquet-hadoop module.

image

@ConeyLiu
Copy link
Contributor

Thanks @wankunde for the contribution. And thanks @wgtmac for ping me.

@gszadovszky
Copy link
Contributor

+1 for the concept. We need to address that ParquetInputSplit is deprecated. Not sure how, though.

if (split.getFooter() != null && split.getFooter().getInputFile() != null) {
inputFile = split.getFooter().getInputFile();
} else {
inputFile = HadoopInputFile.fromPath(path, configuration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the filestatus (or at least file length) can get down here then it becomes possible to skip a HEAD request when opening a file against cloud storage. the api you need is in 3.3.0, and not very reflection friendly. we could add something to assist there.

what is key is: get as much info as possible into HadoopInputFile, especially expected length

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants